This report on the answers of the exercises outline Activity 2 on Data transformation all done in R Studio.
Let’s load the prerequisite libraries, we have assumes that packages has been installed.
library("nycflights13")
library("tidyverse")
## -- Attaching packages ------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ---------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Checking data
?flights
## starting httpd help server ... done
In order to find flights with an arrival delay of two or more ours, we should convert hours into minutes Since the arr_delay variable is measured in minutes. Therefore, the variable we use to check will be 120 or more minutes.
filter(flights, arr_delay >= 120)
First we need to identify where flight have IAH or HOU as the destinations.
THere are two ways to filter it. It shows same answer but the first is more concise.
First:
filter(flights,dest %in% c("IAH", "HOU"))
Noticed that we used %in% to be concised.
Second, we explicitly define dest to be ÍAH or HOU.
filter(flights, dest == "IAH" | dest == "HOU")
Let’s check first what are the airline code for United, American or Delta.
airlines
We can now see that UA is for United Airlines, AA is for American Airlines, and DL is for Delta.
filter(flights, carrier %in% c("AA", "DL", "UA"))
MOnths representation are in number. July is 7, August is 8 and soon.
filter(flights, month >= 7, month <= 9)
using some mathematical operators, we can run:
filter(flights, arr_delay > 120, dep_delay <= 0)
Let’s analyze:
If a flight was delayed by at least an hour, then dep_delay >= 60. If the flight didn’t make up any time in the air, then its arrival would be delayed by the same amount as its departure, meaning dep_delay == arr_delay, or alternatively, dep_delay - arr_delay == 0. If it makes up over 30 minutes in the air, then the arrival delay must be at least 30 minutes less than the departure delay, which is stated as dep_delay - arr_delay > 30.
filter(flights, dep_delay >= 60, dep_delay - arr_delay > 30)
How does midnight is depicted here? This is a riddle so we need to check summary.
summary(flights$dep_time)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1 907 1401 1349 1744 2400 8255
Viola midnight is 2400, not 0 as first thought.
filter(flights, dep_time <= 600 | dep_time == 2400)
The expression between(x, left, right) is equivalent to x >= left & x <= right.
Of the answers in the previous question, we could simplify the statement of departed in summer (month >= 7 & month <= 9) using the between() function.
filter(flights, between(month, 7, 9))
Find the rows of flights with a missing departure time (dep_time) using the is.na() function.
filter(flights, is.na(dep_time))
IT looks like these missing departure times are those canceled flights.
NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? ( NA * 0 is a tricky counterexample!)
Let’s check:
NA ^ 0
## [1] 1
NA ^ 0 == 1 since for all numeric values x0=1 .
NA | TRUE
## [1] TRUE
NA | TRUE is TRUE because anything or TRUE is TRUE. By using OR operator, if the missing value were TRUE, then TRUE | TRUE == TRUE, and if the missing value was FALSE, then FALSE | TRUE == TRUE.
NA & FALSE
## [1] FALSE
The value of NA & FALSE is FALSE because anything and FALSE is always FALSE. If the missing value were TRUE, then TRUE & FALSE == FALSE, and if the missing value was FALSE, then FALSE & FALSE == FALSE.
NA | FALSE
## [1] NA
For NA | FALSE, the value is unknown since TRUE | FALSE == TRUE, but FALSE | FALSE == FALSE.
NA & FALSE
## [1] FALSE
For NA & TRUE, the value is unknown since FALSE & TRUE== FALSE, but TRUE & TRUE == TRUE.
NA * 0
## [1] NA
Since x∗0=0 for all finite numbers we might expect NA * 0 == 0, but that’s not the case. The reason that NA * 0 != 0 is that 0×∞ and 0×−∞
are undefined. R represents undefined results as NaN, which is an abbreviation of “not a number”.
Inf * 0
## [1] NaN
-Inf * 0
## [1] NaN
Using arrange() in default call,
arrange(flights, dep_time)
Let’s look where are the NAs, it looks like they are the end of the list.
arrange(flights, dep_time) %>%
tail()
So let;s use is.na() to put all these NAs at the start of list for sorting.
arrange(flights, desc(is.na(dep_time)), dep_time)
The flights will first be sorted by desc(is.na(dep_time)) because desc(is.na(dep_time)) function is either TRUE when dep_time is missing, or FALSE, when it is not, the rows with missing values of dep_time will come first, since TRUE > FALSE.
To find the most delayed flights, we will sort the table using departure delay, dep_delay, in descending order.
arrange(flights, desc(dep_delay))
The most delayed flights were scheduled to leave on January 09, 2013 09:00. Noticed that the departure time is given as 641, which is 641AM the next day, so around 21 hours and 41 minutes delayed. Curious to see what airlines are the most delayed - it was HA 51, JFK to HNL,
For the earliest departing flight, it can be found by sorting dep_delay in ascending order.
arrange(flights, dep_delay)
The earliest flights were noted to depart December 7, 2013 with dep_time 2040, a 43 minutes early than its scheduled dept time of 2123.
I think to have the fastest flights, it should have the shortest air_time. Adding head() is to ensure it will be visible.
head(arrange(flights, (air_time)))
Fastest flights here are those with 20 minute flights fromEWR to BDL.
Or, we can compute to get the ground speed by dividing distance over air_time.
arrange(flights, desc(distance/air_time))
Based on ground speed, DL 1499 from LGA to ATL is the fastest.
Sort the flights by the distance column in descending order to get the farthest flight.
arrange(flights, desc(distance))
In the data, it looks like the flight going to HNL from JFK is the farthest.
arrange(flights, air_time)
Flight UV 4368 from EWR and BDL is the shortest.
In selecting columns, I think we have few ways to do it.
select(flights, "dep_time", "dep_delay", "arr_time", "arr_delay")
select(flights, 4, 6, 7, 9)
This works, but is not good practice for two reasons. First, the column location of variables may change, resulting in code that may continue to run without error, but produce the wrong answer. Second code is obfuscated, since it is not clear from the code which variables are being selected. What variable does column 6 correspond to? I just wrote the code, and I’ve already forgotten.
select(flights, dep_time, dep_delay, arr_time, arr_delay)
select(flights, all_of(c("dep_time", "dep_delay", "arr_time", "arr_delay")))
select(flights, any_of(c("dep_time", "dep_delay", "arr_time", "arr_delay")))
This is useful because the names of the variables can be stored in a variable and passed to all_of() or any_of().
variables <- c("dep_time", "dep_delay", "arr_time", "arr_delay")
select(flights, all_of(variables))
These two functions replace the deprecated function one_of().
Selecting the variables by matching the start of their names using starts_with().
select(flights, starts_with("dep_"), starts_with("arr_"))
Selecting the variables using regular expressions with matches(). Regular expressions provide a flexible way to match string patterns and are discussed in the Strings chapter.
select(flights, matches("^(dep|arr)_(time|delay)$"))
Specify the names of the variables with a character vector and use the bang-bang operator (!!).
variables <- c("dep_time", "dep_delay", "arr_time", "arr_delay")
select(flights, !!variables)
This and the following answers use the features of tidy evaluation not covered in R4DS but covered in the Programming with dplyr vignette.
Specify the names of the variables in a character or list vector and use the bang-bang-bang operator.
variables <- c("dep_time", "dep_delay", "arr_time", "arr_delay")
select(flights, !!!variables)
Specify the unquoted names of the variables in a list using syms() and use the bang-bang-bang operator.
variables <- syms(c("dep_time", "dep_delay", "arr_time", "arr_delay"))
select(flights, !!!variables)
There are Some things that don’t work:
Matching the ends of their names using ends_with() since this will incorrectly include other variables. For example,
select(flights, ends_with("arr_time"), ends_with("dep_time"))
Matching the names using contains() since there is not a pattern that can include all these variables without incorrectly including others.
select(flights, contains("_time"), contains("arr_"))
select(flights, dep_time, arr_time, carrier, dep_time, dep_time)
It looks like it ignores the duplicate variables in displaying. We need not see any error or warnings. Just it let it pass.
select(flights, everything())
This behavior is useful because it means that we can use select() with everything() in order to easily change the order of columns without having to specify the names of all the columns.
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
Per documentation,
The one_of() function selects variables with a character vector rather than unquoted variable name arguments. This function is useful because it is easier to programmatically generate character vectors with variable names than to generate unquoted variable names, which are easier to type.
select(flights, one_of(vars))
There are also other similar functions: all_of() and any_of()
select(flights, all_of(vars))
select(flights, any_of(vars))
These functions differ in their strictness. The function all_of() will raise an error if one of the variable names is not present, while any_of() will ignore it.
The select() function can now accept the name of a vector containing the variable names you wish to select:
select(flights, vars)
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(vars)` instead of `vars` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
However there is a problem with the previous code. The name vars could refer to a column named vars in flights or a different variable named vars. What th code does will depend on whether or not vars is a column in flights. If vars was a column in flights, then that code would only select the vars column. For example:
flights <- mutate(flights, vars = 1)
select(flights, !!!vars)
This behavior, which is used by many tidyverse functions, is an example of what is called non-standard evaluation (NSE) in R. See the dplyr vignette, Programming with dplyr, for more information on this topic.
select(flights, contains("TIME"))
To get the departure times in the number of minutes, divide dep_time by 100 to get the hours since midnight and multiply by 60 and add the remainder of dep_time divided by 100. For example, 1504 represents 15:04 (or 3:04 PM), which is 904 minutes after midnight. To generalize this approach, we need a way to split out the hour-digits from the minute-digits. Dividing by 100 and discarding the remainder using the integer division operator, %/% gives us the following.
1504 %/% 100
## [1] 15
Instead of %/% could also use / along with trunc() or floor(), but round() would not work. To get the minutes, instead of discarding the remainder of the division by 100, we only want the remainder. So we use the modulo operator, %%, discussed in the Other Useful Functions section.
1504 %% 100
## [1] 4
Now, we can combine the hours (multiplied by 60 to convert them to minutes) and minutes to get the number of minutes after midnight.
1504 %/% 100 * 60 + 1504 %% 100
## [1] 904
There is one remaining issue. Midnight is represented by 2400, which would correspond to 1440 minutes since midnight, but it should correspond to 0. After converting all the times to minutes after midnight, x %% 1440 will convert 1440 to zero while keeping all the other times the same.
Now we will put it all together. The following code creates a new data frame flights_times with columns dep_time_mins and sched_dep_time_mins. These columns convert dep_time and sched_dep_time, respectively, to minutes since midnight.
flights_times <- mutate(flights,
dep_time_mins = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1440,
sched_dep_time_mins = (sched_dep_time %/% 100 * 60 +
sched_dep_time %% 100) %% 1440
)
# view only relevant columns
select(
flights_times, dep_time, dep_time_mins, sched_dep_time,
sched_dep_time_mins
)
Looking ahead to the Functions chapter, this is precisely the sort of situation in which it would make sense to write a function to avoid copying and pasting code. We could define a function time2mins(), which converts a vector of times in from the format used in flights to minutes since midnight.
time2mins <- function(x) {
(x %/% 100 * 60 + x %% 100) %% 1440
}
Using time2mins, the previous code simplifies to the following.
flights_times <- mutate(flights,
dep_time_mins = time2mins(dep_time),
sched_dep_time_mins = time2mins(sched_dep_time)
)
# show only the relevant columns
select(
flights_times, dep_time, dep_time_mins, sched_dep_time,
sched_dep_time_mins
)
I expect that air_time is the difference between the arrival (arr_time) and departure times (dep_time). In other words, air_time = arr_time - dep_time.
To check that this relationship, I’ll first need to convert the times to a form more amenable to arithmetic operations using the same calculations as the previous exercise.
flights_airtime <-
mutate(flights,
dep_time = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1440,
arr_time = (arr_time %/% 100 * 60 + arr_time %% 100) %% 1440,
air_time_diff = air_time - arr_time + dep_time
)
So, does air_time = arr_time - dep_time? If so, there should be no flights with non-zero values of air_time_diff.
nrow(filter(flights_airtime, air_time_diff != 0))
## [1] 327150
It turns out that there are many flights for which air_time != arr_time - dep_time. Other than data errors, I can think of two reasons why air_time would not equal arr_time - dep_time.
The flight passes midnight, so arr_time < dep_time. In these cases, the difference in airtime should be by 24 hours (1,440 minutes).
The flight crosses time zones, and the total air time will be off by hours (multiples of 60). All flights in flights departed from New York City and are domestic flights in the US. This means that flights will all be to the same or more westerly time zones. Given the time-zones in the US, the differences due to time-zone should be 60 minutes (Central) 120 minutes (Mountain), 180 minutes (Pacific), 240 minutes (Alaska), or 300 minutes (Hawaii).
Both of these explanations have clear patterns that I would expect to see if they were true. In particular, in both cases, since time-zones and crossing midnight only affects the hour part of the time, all values of air_time_diff should be divisible by 60. I’ll visually check this hypothesis by plotting the distribution of air_time_diff. If those two explanations are correct, distribution of air_time_diff should comprise only spikes at multiples of 60.
ggplot(flights_airtime, aes(x = air_time_diff)) +
geom_histogram(binwidth = 1)
## Warning: Removed 9430 rows containing non-finite values (stat_bin).
This is not the case. While, the distribution of air_time_diff has modes at multiples of 60 as hypothesized, it shows that there are many flights in which the difference between air time and local arrival and departure times is not divisible by 60.
Let’s also look at flights with Los Angeles as a destination. The discrepancy should be 180 minutes.
ggplot(filter(flights_airtime, dest == "LAX"), aes(x = air_time_diff)) +
geom_histogram(binwidth = 1)
## Warning: Removed 148 rows containing non-finite values (stat_bin).
#> Warning: Removed 148 rows containing non-finite values (stat_bin).
To fix these time-zone issues, I would want to convert all the times to a date-time to handle overnight flights, and from local time to a common time zone, most likely UTC, to handle flights crossing time-zones. The tzone column of nycflights13::airports gives the time-zone of each airport. See the “Dates and Times” for an introduction on working with date and time data.
But that still leaves the other differences unexplained. So what else might be going on? There seem to be too many problems for this to be data entry problems, so I’m probably missing something.
I would expect the departure delay (dep_delay) to be equal to the difference between scheduled departure time (sched_dep_time), and actual departure time (dep_time), dep_time - sched_dep_time = dep_delay.
As with the previous question, the first step is to convert all times to the number of minutes since midnight. The column, dep_delay_diff, is the difference between the column, dep_delay, and departure delay calculated directly from the scheduled and actual departure times.
flights_deptime <-
mutate(flights,
dep_time_min = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1440,
sched_dep_time_min = (sched_dep_time %/% 100 * 60 +
sched_dep_time %% 100) %% 1440,
dep_delay_diff = dep_delay - dep_time_min + sched_dep_time_min
)
Does dep_delay_diff equal zero for all rows?
filter(flights_deptime, dep_delay_diff != 0)
No. Unlike the last question, time zones are not an issue since we are only considering departure times.3 However, the discrepancies could be because a flight was scheduled to depart before midnight, but was delayed after midnight. All of these discrepancies are exactly equal to 1440 (24 hours), and the flights with these discrepancies were scheduled to depart later in the day.
ggplot(
filter(flights_deptime, dep_delay_diff > 0),
aes(y = sched_dep_time_min, x = dep_delay_diff)
) +
geom_point()
Thus the only cases in which the departure delay is not equal to the difference in scheduled departure and actual departure times is due to a quirk in how these columns were stored.
The dplyr package provides multiple functions for ranking, which differ in how they handle tied values: row_number(), min_rank(), dense_rank(). To see how they work, let’s create a data frame with duplicate values in a vector and see how ranking functions handle ties.
rankme <- tibble(
x = c(10, 5, 1, 5, 5)
)
rankme <- mutate(rankme,
x_row_number = row_number(x),
x_min_rank = min_rank(x),
x_dense_rank = dense_rank(x)
)
arrange(rankme, x)
The function row_number() assigns each element a unique value. The result is equivalent to the index (or row) number of each element after sorting the vector, hence its name.
Themin_rank() and dense_rank() assign tied values the same rank, but differ in how they assign values to the next rank. For each set of tied values the min_rank() function assigns a rank equal to the number of values less than that tied value plus one. In contrast, the dense_rank() function assigns a rank equal to the number of distinct values less than that tied value plus one. To see the difference between dense_rank() and min_rank() compare the value of rankme\(x_min_rank and rankme\)x_dense_rank for x = 10.
If I had to choose one for presenting rankings to someone else, I would use min_rank() since its results correspond to the most common usage of rankings in sports or other competitions. In the code below, I use all three functions, but since there are no ties in the top 10 flights, the results don’t differ.
flights_delayed <- mutate(flights,
dep_delay_min_rank = min_rank(desc(dep_delay)),
dep_delay_row_number = row_number(desc(dep_delay)),
dep_delay_dense_rank = dense_rank(desc(dep_delay))
)
flights_delayed <- filter(flights_delayed,
!(dep_delay_min_rank > 10 | dep_delay_row_number > 10 |
dep_delay_dense_rank > 10))
flights_delayed <- arrange(flights_delayed, dep_delay_min_rank)
print(select(flights_delayed, month, day, carrier, flight, dep_delay,
dep_delay_min_rank, dep_delay_row_number, dep_delay_dense_rank),
n = Inf)
## # A tibble: 10 x 8
## month day carrier flight dep_delay dep_delay_min_r~ dep_delay_row_n~
## <int> <int> <chr> <int> <dbl> <int> <int>
## 1 1 9 HA 51 1301 1 1
## 2 6 15 MQ 3535 1137 2 2
## 3 1 10 MQ 3695 1126 3 3
## 4 9 20 AA 177 1014 4 4
## 5 7 22 MQ 3075 1005 5 5
## 6 4 10 DL 2391 960 6 6
## 7 3 17 DL 2119 911 7 7
## 8 6 27 DL 2007 899 8 8
## 9 7 22 DL 2047 898 9 9
## 10 12 5 AA 172 896 10 10
## # ... with 1 more variable: dep_delay_dense_rank <int>
In addition to the functions covered here, the rank() function provides several more ways of ranking elements.
There are other ways to solve this problem that do not using ranking functions. To select the top 10, sort values with arrange() and select the top values with slice:
flights_delayed2 <- arrange(flights, desc(dep_delay))
flights_delayed2 <- slice(flights_delayed2, 1:10)
select(flights_delayed2, month, day, carrier, flight, dep_delay)
Alternatively, we could use the top_n().
flights_delayed3 <- top_n(flights, 10, dep_delay)
flights_delayed3 <- arrange(flights_delayed3, desc(dep_delay))
select(flights_delayed3, month, day, carrier, flight, dep_delay)
The previous two approaches will always select 10 rows even if there are tied values. Ranking functions provide more control over how tied values are handled. Those approaches will provide the 10 rows with the largest values of dep_delay, while ranking functions can provide all rows with the 10 largest values of dep_delay. If there are no ties, these approaches are equivalent. If there are ties, then which is more appropriate depends on the use.
The code given in the question returns the following.
1:3 + 1:10
## Warning in 1:3 + 1:10: longer object length is not a multiple of shorter object
## length
## [1] 2 4 6 5 7 9 8 10 12 11
This is equivalent to the following.
c(1 + 1, 2 + 2, 3 + 3, 1 + 4, 2 + 5, 3 + 6, 1 + 7, 2 + 8, 3 + 9, 1 + 10)
## [1] 2 4 6 5 7 9 8 10 12 11
When adding two vectors, R recycles the shorter vector’s values to create a vector of the same length as the longer vector. The code also raises a warning that the shorter vector is not a multiple of the longer vector. A warning is raised since when this occurs, it is often unintended and may be a bug.
All trigonometric functions are all described in a single help page, named Trig. You can open the documentation for these functions with ?Trig or by using ? with any of the following functions, for example:?sin.
R provides functions for the three primary trigonometric functions: sine (sin()), cosine (cos()), and tangent (tan()). The input angles to all these functions are in radians.
x <- seq(-3, 7, by = 1 / 2)
sin(base::pi * x)
## [1] -3.673819e-16 -1.000000e+00 2.449213e-16 1.000000e+00 -1.224606e-16
## [6] -1.000000e+00 0.000000e+00 1.000000e+00 1.224606e-16 -1.000000e+00
## [11] -2.449213e-16 1.000000e+00 3.673819e-16 -1.000000e+00 -4.898425e-16
## [16] 1.000000e+00 6.123032e-16 -1.000000e+00 -7.347638e-16 1.000000e+00
## [21] 8.572244e-16
cos(base::pi * x)
## [1] -1.000000e+00 3.061516e-16 1.000000e+00 -1.836910e-16 -1.000000e+00
## [6] 6.123032e-17 1.000000e+00 6.123032e-17 -1.000000e+00 -1.836910e-16
## [11] 1.000000e+00 3.061516e-16 -1.000000e+00 -4.286122e-16 1.000000e+00
## [16] 5.510729e-16 -1.000000e+00 -2.449890e-15 1.000000e+00 -9.803627e-16
## [21] -1.000000e+00
tan(base::pi * x)
## [1] 3.673940e-16 -3.266248e+15 2.449294e-16 -5.443746e+15 1.224647e-16
## [6] -1.633124e+16 0.000000e+00 1.633124e+16 -1.224647e-16 5.443746e+15
## [11] -2.449294e-16 3.266248e+15 -3.673940e-16 2.333034e+15 -4.898587e-16
## [16] 1.814582e+15 -6.123234e-16 4.081778e+14 -7.347881e-16 -1.020058e+15
## [21] -8.572528e-16
In the previous code, I used the variable pi. R provides the variable pi which is set to the value of the mathematical constant π.
pi
## [1] 3.141593
#> [1]
Although R provides the pi variable, there is nothing preventing a user from changing its value. For example, I could redefine pi to 3.14 or any other value.
pi <- 3.14
pi
## [1] 3.14
pi <- "Mango"
pi
## [1] "Mango"
For that reason, if you are using the builtin pi variable in computations and are paranoid, you may want to always reference it as base::pi.
base::pi
## [1] 3.141593
In the previous code block, since the angles were in radians, I wrote them as π times some number. Since it is often easier to write radians multiple of π, R provides some convenience functions that do that. The function sinpi(x), is equivalent to sin(pi * x). The functions cospi() and tanpi() are similarly defined for the sin and tan functions, respectively.
sinpi(x)
## [1] 0 -1 0 1 0 -1 0 1 0 -1 0 1 0 -1 0 1 0 -1 0 1 0
cospi(x)
## [1] -1 0 1 0 -1 0 1 0 -1 0 1 0 -1 0 1 0 -1 0 1 0 -1
tanpi(x)
## Warning in tanpi(x): NaNs produced
## [1] 0 NaN 0 NaN 0 NaN 0 NaN 0 NaN 0 NaN 0 NaN 0 NaN 0 NaN 0
## [20] NaN 0
R provides the function arc-cosine (acos()), arc-sine (asin()), and arc-tangent (atan()).
x <- seq(-1, 1, by = 1 / 4)
acos(x)
## [1] 3.1415927 2.4188584 2.0943951 1.8234766 1.5707963 1.3181161 1.0471976
## [8] 0.7227342 0.0000000
asin(x)
## [1] -1.5707963 -0.8480621 -0.5235988 -0.2526803 0.0000000 0.2526803 0.5235988
## [8] 0.8480621 1.5707963
atan(x)
## [1] -0.7853982 -0.6435011 -0.4636476 -0.2449787 0.0000000 0.2449787 0.4636476
## [8] 0.6435011 0.7853982
Finally, R provides the function atan2(). Calling atan2(y, x) returns the angle between the x-axis and the vector from (0,0) to (x, y).
atan2(c(1, 0, -1, 0), c(0, 1, 0, -1))
## [1] 1.570796 0.000000 -1.570796 3.141593
What this question gets at is a fundamental question of data analysis: the cost function. As analysts, the reason we are interested in flight delay because it is costly to passengers. But it is worth thinking carefully about how it is costly and use that information in ranking and measuring these scenarios.
In many scenarios, arrival delay is more important. In most cases, being arriving late is more costly to the passenger since it could disrupt the next stages of their travel, such as connecting flights or scheduled meetings. If a departure is delayed without affecting the arrival time, this delay will not have those affects plans nor does it affect the total time spent traveling. This delay could be beneficial, if less time is spent in the cramped confines of the airplane itself, or a negative, if that delayed time is still spent in the cramped confines of the airplane on the runway.
Variation in arrival time is worse than consistency. If a flight is always 30 minutes late and that delay is known, then it is as if the arrival time is that delayed time. The traveler could easily plan for this. But higher variation in flight times makes it harder to plan.
not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
The first expression is the following.
not_cancelled %>%
count(dest)
The count() function counts the number of instances within each group of variables. Instead of using the count() function, we can combine the group_by() and summarise() verbs.
not_cancelled %>%
group_by(dest) %>%
summarise(n = length(dest))
## `summarise()` ungrouping output (override with `.groups` argument)
An alternative method for getting the number of observations in a data frame is the function n().
not_cancelled %>%
group_by(dest) %>%
summarise(n = n())
## `summarise()` ungrouping output (override with `.groups` argument)
Another alternative to count() is to use group_by() followed by tally(). In fact, count() is effectively a short-cut for group_by() followed by tally().
not_cancelled %>%
group_by(tailnum) %>%
tally()
The second expression also uses the count() function, but adds a wt argument.
not_cancelled %>%
count(tailnum, wt = distance)
As before, we can replicate count() by combining the group_by() and summarise() verbs. But this time instead of using length(), we will use sum() with the weighting variable.
not_cancelled %>%
group_by(tailnum) %>%
summarise(n = sum(distance))
## `summarise()` ungrouping output (override with `.groups` argument)
Like the previous example, we can also use the combination group_by() and tally(). Any arguments to tally() are summed.
not_cancelled %>%
group_by(tailnum) %>%
tally(distance)
If a flight never departs, then it won’t arrive. A flight could also depart and not arrive if it crashes, or if it is redirected and lands in an airport other than its intended destination. So the most important column is arr_delay, which indicates the amount of delay in arrival.
filter(flights, !is.na(dep_delay), is.na(arr_delay)) %>%
select(dep_time, arr_time, sched_arr_time, dep_delay, arr_delay)
In this data dep_time can be non-missing and arr_delay missing but arr_time not missing. Some further research found that these rows correspond to diverted flights. The BTS database that is the source for the flights table contains additional information for diverted flights that is not included in the nycflights13 data. The source contains a column DivArrDelay with the description:
Difference in minutes between scheduled and actual arrival time for a diverted flight reaching scheduled destination. The ArrDelay column remains NULL for all diverted flights.
One pattern in cancelled flights per day is that the number of cancelled flights increases with the total number of flights per day. The proportion of cancelled flights increases with the average delay of flights.
To answer these questions, use definition of cancelled used in the chapter Section 5.6.3 and the relationship {r}!(is.na(arr_delay) & {r} is.na(dep_delay)) is equal to {r} !is.na(arr_delay) | !is.na(dep_delay).
The first part of the question asks for any pattern in the number of cancelled flights per day. I’ll look at the relationship between the number of canceled flights per day and the total number of flights in a day. There should be an increasing relationship for two reasons. First, if all flights are equally likely to be cancelled, then days with more flights should have a higher number of cancellations. Second, it is likely that days with more flights would have a higher probability of cancellations because congestion itself can cause delays and any delay would affect more flights, and large delays can lead to cancellations.
cancelled_per_day <-
flights %>%
mutate(cancelled = (is.na(arr_delay) | is.na(dep_delay))) %>%
group_by(year, month, day) %>%
summarise(
cancelled_num = sum(cancelled),
flights_num = n(),
)
## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
Plotting flights_num against cancelled_num shows that the number of flights canceled increases with the total number of flights.
ggplot(cancelled_per_day) +
geom_point(aes(x = flights_num, y = cancelled_num))
cancelled_and_delays <-
flights %>%
mutate(cancelled = (is.na(arr_delay) | is.na(dep_delay))) %>%
group_by(year, month, day) %>%
summarise(
cancelled_prop = mean(cancelled),
avg_dep_delay = mean(dep_delay, na.rm = TRUE),
avg_arr_delay = mean(arr_delay, na.rm = TRUE)
) %>%
ungroup()
## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
The second part of the question asks whether there is a relationship between the proportion of flights cancelled and the average departure delay. I implied this in my answer to the first part of the question, when I noted that increasing delays could result in increased cancellations. The question does not specify which delay, so I will show the relationship for both.
cancelled_and_delays <-
flights %>%
mutate(cancelled = (is.na(arr_delay) | is.na(dep_delay))) %>%
group_by(year, month, day) %>%
summarise(
cancelled_prop = mean(cancelled),
avg_dep_delay = mean(dep_delay, na.rm = TRUE),
avg_arr_delay = mean(arr_delay, na.rm = TRUE)
) %>%
ungroup()
## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
There is a strong increasing relationship between both average departure delay and and average arrival delay and the proportion of cancelled flights.
ggplot(cancelled_and_delays) +
geom_point(aes(x = avg_dep_delay, y = cancelled_prop))
flights %>%group_by(carrier, dest) %>% summarise(n()) )flights %>%
group_by(carrier) %>%
summarise(arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(desc(arr_delay))
## `summarise()` ungrouping output (override with `.groups` argument)
What airline corresponds to the “F9” carrier code?
filter(airlines, carrier == "F9")
You can get part of the way to disentangling the effects of airports versus bad carriers by comparing the average delay of each carrier to the average delay of flights within a route (flights from the same origin to the same destination). Comparing delays between carriers and within each route disentangles the effect of carriers and airports. A better analysis would compare the average delay of a carrier’s flights to the average delay of all other carrier’s flights within a route.
flights %>%
filter(!is.na(arr_delay)) %>%
# Total delay by carrier within each origin, dest
group_by(origin, dest, carrier) %>%
summarise(
arr_delay = sum(arr_delay),
flights = n()
) %>%
# Total delay within each origin dest
group_by(origin, dest) %>%
mutate(
arr_delay_total = sum(arr_delay),
flights_total = sum(flights)
) %>%
# average delay of each carrier - average delay of other carriers
ungroup() %>%
mutate(
arr_delay_others = (arr_delay_total - arr_delay) /
(flights_total - flights),
arr_delay_mean = arr_delay / flights,
arr_delay_diff = arr_delay_mean - arr_delay_others
) %>%
# remove NaN values (when there is only one carrier)
filter(is.finite(arr_delay_diff)) %>%
# average over all airports it flies to
group_by(carrier) %>%
summarise(arr_delay_diff = mean(arr_delay_diff)) %>%
arrange(desc(arr_delay_diff))
## `summarise()` regrouping output by 'origin', 'dest' (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
count() do. When might you use it?
The sort argument to count() sorts the results in order of n. You could use this anytime you would run count() followed by arrange().
For example, the following expression counts the number of flights to a destination and sorts the returned data from highest to lowest.
flights %>%
count(dest, sort = TRUE)
Summary functions (mean()), offset functions (lead(), lag()), ranking functions (min_rank(), row_number()), operate within each group when used with group_by() in mutate() or filter(). Arithmetic operators (+, -), logical operators (<, ==), modular arithmetic operators (%%, %/%), logarithmic functions (log) are not affected by group_by.
Summary functions like mean(), median(), sum(), std() and others covered in the section Useful Summary Functions calculate their values within each group when used with mutate() or filter() and group_by().
tibble(x = 1:9,
group = rep(c("a", "b", "c"), each = 3)) %>%
mutate(x_mean = mean(x)) %>%
group_by(group) %>%
mutate(x_mean_2 = mean(x))
Arithmetic operators +, -, *, /, ^ are not affected by group_by().
tibble(x = 1:9,
group = rep(c("a", "b", "c"), each = 3)) %>%
mutate(y = x + 2) %>%
group_by(group) %>%
mutate(z = x + 2)
The modular arithmetic operators %/% and %% are not affected by group_by()
tibble(x = 1:9,
group = rep(c("a", "b", "c"), each = 3)) %>%
mutate(y = x %% 2) %>%
group_by(group) %>%
mutate(z = x %% 2)
The offset functions lead() and lag() respect the groupings in group_by(). The functions lag() and lead() will only return values within each group.
tibble(x = 1:9,
group = rep(c("a", "b", "c"), each = 3)) %>%
group_by(group) %>%
mutate(lag_x = lag(x),
lead_x = lead(x))
The question does not define a way to measure on-time record, so I will consider two metrics:
proportion of flights not delayed or cancelled, and mean arrival delay. The first metric is the proportion of not-cancelled and on-time flights. I use the presence of an arrival time as an indicator that a flight was not cancelled. However, there are many planes that have never flown an on-time flight. Additionally, many of the planes that have the lowest proportion of on-time flights have only flown a small number of flights.
flights %>%
filter(!is.na(tailnum)) %>%
mutate(on_time = !is.na(arr_time) & (arr_delay <= 0)) %>%
group_by(tailnum) %>%
summarise(on_time = mean(on_time), n = n()) %>%
filter(min_rank(on_time) == 1)
## `summarise()` ungrouping output (override with `.groups` argument)
So, I will remove planes that flew at least 20 flights. The choice of 20 was chosen because it round number near the first quartile of the number of flights by plane.56
quantile(count(flights, tailnum)$n)
## 0% 25% 50% 75% 100%
## 1 23 54 110 2512
The plane with the worst on time record that flew at least 20 flights is:
flights %>%
filter(!is.na(tailnum), is.na(arr_time) | !is.na(arr_delay)) %>%
mutate(on_time = !is.na(arr_time) & (arr_delay <= 0)) %>%
group_by(tailnum) %>%
summarise(on_time = mean(on_time), n = n()) %>%
filter(n >= 20) %>%
filter(min_rank(on_time) == 1)
## `summarise()` ungrouping output (override with `.groups` argument)
There are cases where arr_delay is missing but arr_time is not missing. I have not debugged the cause of this bad data, so these rows are dropped for the purposes of this exercise.
The second metric is the mean minutes delayed. As with the previous metric, I will only consider planes which flew least 20 flights. A different plane has the worst on-time record when measured as average minutes delayed.
flights %>%
filter(!is.na(arr_delay)) %>%
group_by(tailnum) %>%
summarise(arr_delay = mean(arr_delay), n = n()) %>%
filter(n >= 20) %>%
filter(min_rank(desc(arr_delay)) == 1)
## `summarise()` ungrouping output (override with `.groups` argument)
Let’s group by the hour of the flight. The earlier the flight is scheduled, the lower its expected delay. This is intuitive as delays will affect later flights. Morning flights have fewer (if any) previous flights that can delay them.
flights %>%
group_by(hour) %>%
summarise(arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(arr_delay)
## `summarise()` ungrouping output (override with `.groups` argument)
The key to answering this question is to only include delayed flights when calculating the total delay and proportion of delay.
flights %>%
filter(arr_delay > 0) %>%
group_by(dest) %>%
mutate(
arr_delay_total = sum(arr_delay),
arr_delay_prop = arr_delay / arr_delay_total
) %>%
select(dest, month, day, dep_time, carrier, flight,
arr_delay, arr_delay_prop) %>%
arrange(dest, desc(arr_delay_prop))
There is some ambiguity in the meaning of the term flights in the question. The first example defined a flight as a row in the flights table, which is a trip by an aircraft from an airport at a particular date and time. However, flight could also refer to the flight number, which is the code a carrier uses for an airline service of a route. For example, AA1 is the flight number of the 09:00 American Airlines flight between JFK and LAX. The flight number is contained in the flights\(flight column, though what is called a “flight” is a combination of the flights\)carrier and flights$flight columns.
flights %>%
filter(arr_delay > 0) %>%
group_by(dest, origin, carrier, flight) %>%
summarise(arr_delay = sum(arr_delay)) %>%
group_by(dest) %>%
mutate(
arr_delay_prop = arr_delay / sum(arr_delay)
) %>%
arrange(dest, desc(arr_delay_prop)) %>%
select(carrier, flight, origin, dest, arr_delay_prop)
## `summarise()` regrouping output by 'dest', 'origin', 'carrier' (override with `.groups` argument)
This calculates the departure delay of the preceding flight from the same airport.
lagged_delays <- flights %>%
arrange(origin, month, day, dep_time) %>%
group_by(origin) %>%
mutate(dep_delay_lag = lag(dep_delay)) %>%
filter(!is.na(dep_delay), !is.na(dep_delay_lag))
This plots the relationship between the mean delay of a flight for all values of the previous flight. For delays less than two hours, the relationship between the delay of the preceding flight and the current flight is nearly a line. After that the relationship becomes more variable, as long-delayed flights are interspersed with flights leaving on-time. After about 8-hours, a delayed flight is likely to be followed by a flight leaving on time.
lagged_delays %>%
group_by(dep_delay_lag) %>%
summarise(dep_delay_mean = mean(dep_delay)) %>%
ggplot(aes(y = dep_delay_mean, x = dep_delay_lag)) +
geom_point() +
scale_x_continuous(breaks = seq(0, 1500, by = 120)) +
labs(y = "Departure Delay", x = "Previous Departure Delay")
## `summarise()` ungrouping output (override with `.groups` argument)
The overall relationship looks similar in all three origin airports.
lagged_delays %>%
group_by(origin, dep_delay_lag) %>%
summarise(dep_delay_mean = mean(dep_delay)) %>%
ggplot(aes(y = dep_delay_mean, x = dep_delay_lag)) +
geom_point() +
facet_wrap(~ origin, ncol=1) +
labs(y = "Departure Delay", x = "Previous Departure Delay")
## `summarise()` regrouping output by 'origin' (override with `.groups` argument)
When calculating this answer we should only compare flights within the same (origin, destination) pair.
To find unusual observations, we need to first put them on the same scale. I will standardize values by subtracting the mean from each and then dividing each by the standard deviation. s t a n d a r d i z e d ( x ) = x − m e a n ( x ) s d ( x ) .
A standardized variable is often called a
z -score. The units of the standardized variable are standard deviations from the mean. This will put the flight times from different routes on the same scale. The larger the magnitude of the standardized variable for an observation, the more unusual the observation is. Flights with negative values of the standardized variable are faster than the mean flight for that route, while those with positive values are slower than the mean flight for that route.
standardized_flights <- flights %>%
filter(!is.na(air_time)) %>%
group_by(dest, origin) %>%
mutate(
air_time_mean = mean(air_time),
air_time_sd = sd(air_time),
n = n()
) %>%
ungroup() %>%
mutate(air_time_standard = (air_time - air_time_mean) / (air_time_sd + 1))
I add 1 to the denominator and numerator to avoid dividing by zero. Note that the ungroup() here is not necessary. However, I will be using this data frame later. Through experience, I have found that I have fewer bugs when I keep a data frame grouped for only those verbs that need it. If I did not ungroup() this data frame, the arrange() used later would not work as expected. It is better to err on the side of using ungroup() when unnecessary.
The distribution of the standardized air flights has long right tail.
ggplot(standardized_flights, aes(x = air_time_standard)) +
geom_density()
## Warning: Removed 4 rows containing non-finite values (stat_density).
Unusually fast flights are those flights with the smallest standardized values.
standardized_flights %>%
arrange(air_time_standard) %>%
select(
carrier, flight, origin, dest, month, day,
air_time, air_time_mean, air_time_standard
) %>%
head(10) %>%
print(width = Inf)
## # A tibble: 10 x 9
## carrier flight origin dest month day air_time air_time_mean
## <chr> <int> <chr> <chr> <int> <int> <dbl> <dbl>
## 1 DL 1499 LGA ATL 5 25 65 114.
## 2 EV 4667 EWR MSP 7 2 93 151.
## 3 EV 4292 EWR GSP 5 13 55 93.2
## 4 EV 3805 EWR BNA 3 23 70 115.
## 5 EV 4687 EWR CVG 9 29 62 96.1
## 6 B6 2002 JFK BUF 11 10 38 57.1
## 7 DL 1902 LGA PBI 1 12 105 146.
## 8 DL 161 JFK SEA 7 3 275 329.
## 9 EV 5486 LGA PIT 4 28 40 57.7
## 10 B6 30 JFK ROC 3 25 35 51.9
## air_time_standard
## <dbl>
## 1 -4.56
## 2 -4.46
## 3 -4.20
## 4 -3.73
## 5 -3.60
## 6 -3.38
## 7 -3.34
## 8 -3.34
## 9 -3.15
## 10 -3.10
I used width = Inf to ensure that all columns will be printed.
The fastest flight is DL1499 from LGA to ATL which departed on 2013-05-25 at 17:09. It has an air time of 65 minutes, compared to an average flight time of 114 minutes for its route. This is 4.6 standard deviations below the average flight on its route.
It is important to note that this does not necessarily imply that there was a data entry error. We should check these flights to see whether there was some reason for the difference. It may be that we are missing some piece of information that explains these unusual times.
A potential issue with the way that we standardized the flights is that the mean and standard deviation used to calculate are sensitive to outliers and outliers is what we are looking for. Instead of standardizing variables with the mean and variance, we could use the median as a measure of central tendency and the interquartile range (IQR) as a measure of spread. The median and IQR are more resistant to outliers than the mean and standard deviation. The following method uses the median and inter-quartile range, which are less sensitive to outliers.
standardized_flights2 <- flights %>%
filter(!is.na(air_time)) %>%
group_by(dest, origin) %>%
mutate(
air_time_median = median(air_time),
air_time_iqr = IQR(air_time),
n = n(),
air_time_standard = (air_time - air_time_median) / air_time_iqr)
The distribution of the standardized air flights using this new definition also has long right tail of slow flights.
ggplot(standardized_flights2, aes(x = air_time_standard)) +
geom_density()
## Warning: Removed 4 rows containing non-finite values (stat_density).
Unusually fast flights are those flights with the smallest standardized values.
standardized_flights2 %>%
arrange(air_time_standard) %>%
select(
carrier, flight, origin, dest, month, day, air_time,
air_time_median, air_time_standard
) %>%
head(10) %>%
print(width = Inf)
## # A tibble: 10 x 9
## # Groups: dest, origin [10]
## carrier flight origin dest month day air_time air_time_median
## <chr> <int> <chr> <chr> <int> <int> <dbl> <dbl>
## 1 EV 4667 EWR MSP 7 2 93 149
## 2 DL 1499 LGA ATL 5 25 65 112
## 3 US 2132 LGA BOS 3 2 21 37
## 4 B6 30 JFK ROC 3 25 35 51
## 5 B6 2002 JFK BUF 11 10 38 57
## 6 EV 4292 EWR GSP 5 13 55 92
## 7 EV 4249 EWR SYR 3 15 30 39
## 8 EV 4580 EWR BTV 6 29 34 46
## 9 EV 3830 EWR RIC 7 2 35 53
## 10 EV 4687 EWR CVG 9 29 62 95
## air_time_standard
## <dbl>
## 1 -3.5
## 2 -3.36
## 3 -3.2
## 4 -3.2
## 5 -3.17
## 6 -3.08
## 7 -3
## 8 -3
## 9 -3
## 10 -3
All of these answers have relied only on using a distribution of comparable observations to find unusual observations. In this case, the comparable observations were flights from the same origin to the same destination. Apart from our knowledge that flights from the same origin to the same destination should have similar air times, we have not used any other domain-specific knowledge. But we know much more about this problem. The most obvious piece of knowledge we have is that we know that flights cannot travel back in time, so there should never be a flight with a negative airtime. But we also know that aircraft have maximum speeds. While different aircraft have different cruising speeds, commercial airliners typically cruise at air speeds around 547–575 mph. Calculating the ground speed of aircraft is complicated by the way in which winds, especially the influence of wind, especially jet streams, on the ground-speed of flights. A strong tailwind can increase ground-speed of the aircraft by 200 mph. Apart from the retired Concorde. For example, in 2018, a transatlantic flight traveled at 770 mph due to a strong jet stream tailwind. This means that any flight traveling at speeds greater than 800 mph is implausible, and it may be worth checking flights traveling at greater than 600 or 700 mph. Ground speed could also be used to identify aircraft flying implausibly slow. Joining flights data with the air craft type in the planes table and getting information about typical or top speeds of those aircraft could provide a more detailed way to identify implausibly fast or slow flights. Additional data on high altitude wind speeds at the time of the flight would further help.
Knowing the substance of the data analysis at hand is one of the most important tools of a data scientist. The tools of statistics are a complement, not a substitute, for that knowledge.
With that in mind, Let’s plot the distribution of the ground speed of flights. The modal flight in this data has a ground speed of between 400 and 500 mph. The distribution of ground speeds has a large left tail of slower flights below 400 mph constituting the majority. There are very few flights with a ground speed over 500 mph.
flights %>%
mutate(mph = distance / (air_time / 60)) %>%
ggplot(aes(x = mph)) +
geom_histogram(binwidth = 10)
## Warning: Removed 9430 rows containing non-finite values (stat_bin).
The fastest flight is the same one identified as the largest outlier earlier. Its ground speed was 703 mph. This is fast for a commercial jet, but not impossible.
flights %>%
mutate(mph = distance / (air_time / 60)) %>%
arrange(desc(mph)) %>%
select(mph, flight, carrier, flight, month, day, dep_time) %>%
head(5)
One explanation for unusually fast flights is that they are “making up time” in the air by flying faster. Commercial aircraft do not fly at their top speed since the airlines are also concerned about fuel consumption. But, if a flight is delayed on the ground, it may fly faster than usual in order to avoid a late arrival. So, I would expect that some of the unusually fast flights were delayed on departure.
flights %>%
mutate(mph = distance / (air_time / 60)) %>%
arrange(desc(mph)) %>%
select(
origin, dest, mph, year, month, day, dep_time, flight, carrier,
dep_delay, arr_delay
)
Five of the top ten flights had departure delays, and three of those were able to make up that time in the air and arrive ahead of schedule.
Overall, there were a few flights that seemed unusually fast, but they all fall into the realm of plausibility and likely are not data entry problems. [Ed. Please correct me if I am missing something]
The second part of the question asks us to compare flights to the fastest flight on a route to find the flights most delayed in the air. I will calculate the amount a flight is delayed in air in two ways. The first is the absolute delay, defined as the number of minutes longer than the fastest flight on that route,air_time - min(air_time). The second is the relative delay, which is the percentage increase in air time relative to the time of the fastest flight along that route, (air_time - min(air_time)) / min(air_time) * 100.
air_time_delayed <-
flights %>%
group_by(origin, dest) %>%
mutate(
air_time_min = min(air_time, na.rm = TRUE),
air_time_delay = air_time - air_time_min,
air_time_delay_pct = air_time_delay / air_time_min * 100
)
## Warning: Problem with `mutate()` input `air_time_min`.
## i no non-missing arguments to min; returning Inf
## i Input `air_time_min` is `min(air_time, na.rm = TRUE)`.
## i The error occurred in group 41: origin = "EWR", dest = "LGA".
## Warning in min(air_time, na.rm = TRUE): no non-missing arguments to min;
## returning Inf
The most delayed flight in air in minutes was DL841 from JFK to SFO which departed on 2013-07-28 at 17:27. It took 189 minutes longer than the flight with the shortest air time on its route.
air_time_delayed %>%
arrange(desc(air_time_delay)) %>%
select(
air_time_delay, carrier, flight,
origin, dest, year, month, day, dep_time,
air_time, air_time_min
) %>%
head() %>%
print(width = Inf)
## # A tibble: 6 x 11
## # Groups: origin, dest [5]
## air_time_delay carrier flight origin dest year month day dep_time air_time
## <dbl> <chr> <int> <chr> <chr> <int> <int> <int> <int> <dbl>
## 1 189 DL 841 JFK SFO 2013 7 28 1727 490
## 2 165 DL 426 JFK LAX 2013 11 22 1812 440
## 3 163 AA 575 JFK EGE 2013 1 28 1806 382
## 4 147 DL 17 JFK LAX 2013 7 10 1814 422
## 5 145 UA 745 LGA DEN 2013 9 10 1513 331
## 6 143 UA 587 EWR LAS 2013 11 22 2142 399
## air_time_min
## <dbl>
## 1 301
## 2 275
## 3 219
## 4 275
## 5 186
## 6 256
The most delayed flight in air as a percentage of the fastest flight along that route was US2136 from LGA to BOS departing on 2013-06-17 at 16:52. It took 410% longer than the flight with the shortest air time on its route.
air_time_delayed %>%
arrange(desc(air_time_delay)) %>%
select(
air_time_delay_pct, carrier, flight,
origin, dest, year, month, day, dep_time,
air_time, air_time_min
) %>%
head() %>%
print(width = Inf)
## # A tibble: 6 x 11
## # Groups: origin, dest [5]
## air_time_delay_pct carrier flight origin dest year month day dep_time
## <dbl> <chr> <int> <chr> <chr> <int> <int> <int> <int>
## 1 62.8 DL 841 JFK SFO 2013 7 28 1727
## 2 60 DL 426 JFK LAX 2013 11 22 1812
## 3 74.4 AA 575 JFK EGE 2013 1 28 1806
## 4 53.5 DL 17 JFK LAX 2013 7 10 1814
## 5 78.0 UA 745 LGA DEN 2013 9 10 1513
## 6 55.9 UA 587 EWR LAS 2013 11 22 2142
## air_time air_time_min
## <dbl> <dbl>
## 1 490 301
## 2 440 275
## 3 382 219
## 4 422 275
## 5 331 186
## 6 399 256
To restate this question, we are asked to rank airlines by the number of destinations that they fly to, considering only those airports that are flown to by two or more airlines. There are two steps to calculating this ranking. First, find all airports serviced by two or more carriers. Then, rank carriers by the number of those destinations that they service.
flights %>%
# find all airports with > 1 carrier
group_by(dest) %>%
mutate(n_carriers = n_distinct(carrier)) %>%
filter(n_carriers > 1) %>%
# rank carriers by numer of destinations
group_by(carrier) %>%
summarize(n_dest = n_distinct(dest)) %>%
arrange(desc(n_dest))
## `summarise()` ungrouping output (override with `.groups` argument)
The carrier “EV” flies to the most destinations, considering only airports flown to by two or more carriers. What airline does the “EV” carrier code correspond to?
filter(airlines, carrier == "EV")
Unless you know the airplane industry, it is likely that you don’t recognize ExpressJet; I certainly didn’t. It is a regional airline that partners with major airlines to fly from hubs (larger airports) to smaller airports. This means that many of the shorter flights of major carriers are operated by ExpressJet. This business model explains why ExpressJet services the most destinations.
Among the airlines that fly to only one destination from New York are Alaska Airlines and Hawaiian Airlines.
filter(airlines, carrier %in% c("AS", "F9", "HA"))
The question does not specify arrival or departure delay. I consider dep_delay in this answer, though similar code could be used for arr_delay.
flights %>%
# sort in increasing order
select(tailnum, year, month,day, dep_delay) %>%
filter(!is.na(dep_delay)) %>%
arrange(tailnum, year, month, day) %>%
group_by(tailnum) %>%
# cumulative number of flights delayed over one hour
mutate(cumulative_hr_delays = cumsum(dep_delay > 60)) %>%
# count the number of flights == 0
summarise(total_flights = sum(cumulative_hr_delays < 1)) %>%
arrange(total_flights)
## `summarise()` ungrouping output (override with `.groups` argument)